ROCm 및 HIP: GPU 성능의 메모리 중심성에 대한 상세한 10장 강의

GPU 가속에서는 '계산 우선' 사고방식을 버려야 합니다. 현대의 성능은 메모리 관리: 호스트(CPU)와 장치(GPU) 사이의 데이터 할당, 동기화, 최적화를 조율하는 것.

1. 메모리-계산 격차

GPU의 산술 처리량($TFLOPS$)이 급격히 증가했지만, 메모리 대역폭($GB/s$)은 훨씬 느린 속도로 증가했습니다. 이로 인해 실행 단위들이 종종 '결핍' 상태에 빠지며, VRAM에서 데이터가 도착하기를 기다리는 경우가 많습니다. 결과적으로, GPU 프로그래밍은 종종 메모리 프로그래밍입니다.

2. 루프라인 모델

이 모델은 산술 밀도 (FLOPs/Byte)와 성능 간의 관계를 시각화합니다. 응용 프로그램은 일반적으로 두 가지 범주로 나뉩니다:

메모리 제약형: 대역폭으로 제한됨(급경사 구간).
계산 제약형: 정점 TFLOPS로 제한됨(수평 천장).

3. 데이터 이동의 비용

주요 성능 저하 원인은 거의 수학 계산이 아니라, 데이터를 PCIe 버스를 통해 또는 HBM에서 이동시키는 지연 시간과 에너지 소비입니다. 고성능 코드는 데이터의 위치 유지에 중점을 두고, 호스트-장치 간 전송 횟수를 최소화합니다.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of a GPU kernel being 'memory-bound'?

The clock speed of the GPU cores is too slow.

The rate of data delivery is slower than the rate of arithmetic execution.

There are too many threads running in parallel.

The CPU is faster than the GPU.

QUESTION 2

In the context of GPU programming, what does 'Memory Management' involve?

Only allocating variables on the CPU stack.

Controlling allocation, synchronization, and optimization of data transfer between host and device.

Optimizing the cache size of the L1 controller.

Manually cleaning the GPU registers after every kernel call.

QUESTION 3

Which axis of the Roofline Model represents 'Arithmetic Intensity'?

Vertical Axis (Y)

Horizontal Axis (X)

The slope of the line.

The area under the curve.

QUESTION 4

Why is redundant host-device transfer considered a 'performance tax'?

It consumes GPU registers.

Latency and energy consumption of moving data across PCIe is significantly higher than instruction execution.

It increases the floating-point precision error.

It causes the GPU to overheat instantly.

QUESTION 5

If a researcher's kernel spends 95% of its time 'stalled,' what is the most likely culprit?

The math instructions are too complex.

Inefficient orchestration of data residence causing the GPU to wait for data.

The GPU has too much VRAM.

The kernel was written in C++ instead of Python.